Skip to content

Add ControllerStartupLatency metric for SandboxClaims#522

Merged
k8s-ci-robot merged 3 commits intokubernetes-sigs:mainfrom
igooch:sandboxclaim-controller-latency
Apr 6, 2026
Merged

Add ControllerStartupLatency metric for SandboxClaims#522
k8s-ci-robot merged 3 commits intokubernetes-sigs:mainfrom
igooch:sandboxclaim-controller-latency

Conversation

@igooch
Copy link
Copy Markdown
Contributor

@igooch igooch commented Apr 4, 2026

This PR introduces a new metric, agent_sandbox_claim_controller_startup_latency_ms, to provide higher precision tracking of SandboxClaim startup performance.

Problem

Currently, startup latency is measured using the standard Kubernetes creationTimestamp. However, this timestamp has one-second granularity. For fast-provisioning resources like SandboxClaims, where target latencies are often in the millisecond range, this granularity is too coarse and leads to inaccurate P50/P90 metrics.

Proposed Solution

The controller now stamps a high-precision controller-first-observed-at annotation during its first reconciliation cycle. The new metric measures the duration from this observation point to the "Ready" state.

Notes for the reviewer

  • Measures Controller-Observed Latency: This tracks the duration from the controller's first observation to the "Ready" state, rather than total client-perceived creation time. (A separate SDK metric will be created to track the full client-to-Ready latency).
  • Excludes Pre-Reconciliation Overhead: It omits initial API server processing, watch latency, and workqueue delays occurring before the first reconciliation cycle. This makes it a "partial" server-side metric focused strictly on controller performance.
  • Requires Inline Patching: Recording the high-precision timestamp adds an extra API call (inline patch) during the first reconciliation. To minimize API overhead, this is bundled with the tracing annotation patch whenever tracing is enabled.

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 4, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit e193293
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69d3ea73be3a320008fd655a

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 4, 2026
}
claim.Annotations[asmetrics.TraceContextAnnotation] = tc
if needObsPatch {
claim.Annotations[obsAnnotation] = time.Now().Format(time.RFC3339Nano)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider just keeping an in memory map, but ... given we're already writing to the apiserver for the trace... sgtm

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the latency test using a sync.Map to keep track of the times instead of this patch update, adding the values to the map with LoadOrStore at the beginning of the Reconcile loop and deleting when the metric is recorded to keep the map from growing definitely. It actually ran slower than sending a patch update to the apiserver. Granted I ran this test both times with tracing turned on, so I can test that again without tracing.

@aditya-shantanu
Copy link
Copy Markdown
Contributor

Thanks Ivy.

I think we also need to have a version of the original metric where it optionally looks at a client provided timestamp (in a pre-defined annotation).
if that time isn't set, we use now() and effectively that metric will "fall back" to this one. thoughts ?

OR

we can skip emitting the claim_latency_metric if that annotation is not set.

@igooch igooch force-pushed the sandboxclaim-controller-latency branch from f750786 to 24ffa48 Compare April 6, 2026 16:45
@aditya-shantanu
Copy link
Copy Markdown
Contributor

/lgtm
/approve
/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 6, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aditya-shantanu, igooch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 6, 2026
@igooch igooch force-pushed the sandboxclaim-controller-latency branch from 0638598 to e193293 Compare April 6, 2026 17:16
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 6, 2026
Copy link
Copy Markdown
Contributor

@aditya-shantanu aditya-shantanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2026
@k8s-ci-robot k8s-ci-robot merged commit 4eceb03 into kubernetes-sigs:main Apr 6, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants